ADDITIVE MODELS IN HIGH DIMENSIONS 
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Abstract. We discuss some aspects of approximating functions on high-dimensional 
data sets with additive functions or ANOVA decompositions, that is, sums of func- 
tions depending on fewer variables each. It is seen that under appropriate smooth- 
ness conditions, the errors of the ANOVA decompositions are of order 0(n m / 2 ) for 
approximations using sums of functions of up to m variables under some mild restric- 
tions on the (possibly dependent) predictor variables. Several simulated examples 
illustrate this behaviour. 



1. Introduction 

In this paper we want to understand some aspects of behaviour of additive predic- 



tive models |L0| for high-dimensional data sets. More specifically, we will deal with 
the problem of approximating a real- valued predictor function / defined on a domain 
fl and depending on variables Xi, x 2 , ■ ■ ■ ,x n , . . . by additive functions / a dd in fewer 
(say n) variables and of lower order of interaction, say m, having the form 

(1) /add(x) = f Q + fl(x 1 )+f 2 (x 2 ) + --- + f n (x n ) + 

fl,2\%U %2) + • • ■ + /ii,i 2 (^ii) 2^2) + " " ' fn-l,n\ x n-l: x n) + ' ' ' + 
/ii,i2,- ,i m \ x ii 1 x h ? ' ' ' 5 x im^) ' ' ' fn—m+1,— ,n\ x n—m+l > * * ' > x n) > 

where typically m <C n. Our aim will be to obtain easily verifiable upper bounds 
on the L2-norm of the remainder / — f a dd, as well as to try and understand their 
dependence on the dimension n of the domain Q. In doing so, we will move away 
from the condition of independence of the predictor variables Xi,x 2 ,..., replacing 
it with a milder restriction of the probability distribution being equivalent to the 
product of its marginals. Our approximation with additive functions is optimal (with 
regard to the mean square error) for independent variables, but not necessarily in the 
dependent case, where we obtain an upper error bound. At the same time, we are 
not yet ready to offer concrete algorithms for real very large datasets. 

In one of its forms, the phenomenon of concentration of measure P, [TR 113 says 



that every Lipschitz function on a sufficiently high-dimensional domain f2 is well- 
approximated by a constant function, that is, an additive function of the lowest 
possible order of interaction m = 0. However, as one would expect, the limitations of 
this result are such as to render it inapplicable in our situation: a reasonably good 
approximation requires the intrinsic dimension of a dataset to be prohibitively high. 

The most natural question is therefore, can one achieve a better approximation 
in lower (mid to high) dimensions by merely allowing additive functions of a higher 
interaction order m > 0? Even here the answer turns out to be negative: there 
exist functions for which approximation by constants is the best possible among all 
additive functions in the orders up to m — n — 1. This result makes it clear that the 
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only way to achieve a better approximation by additive functions is to impose further 
restrictions on the functions /. Our suggestion is to consider smooth functions and 
generalise the standard Lipschitz condition by requiring the L 2 -norm of the vector 
of all mixed derivatives of order k < m to be bounded above by a constant L m , 
independent of the dimension of the domain Q. 

Under such restrictions and an additional condition of independence of predictor 
variables x%, X2, ■ ■ ■ , x n , we develop a technique for obtaining approximating additive 
functions of a prescribed order and derive upper error bounds in the L 2 -norm (Section 
0.) Our results are illustrated by a series of examples in the last Section f|, in particular 
showing that the asymptotic rate of convergence of the theoretically derived error is 
accurate. 

Section |3.2| aims at relaxing the assumption of independence of predictor vari- 
ables. Recall that random variables independent if the probabil- 
ity distribution, p(x), can be written as the product of the marginal distributions, 
p(x) = Pi(xi)p2(x2) ■ ■ .p n (x n ). We replace this with the assumption which we call 
quasi-independence and which calls for the distribution of x to be equivalent to the 
product of its marginals. The Radon-Nikodym theorem then implies that the distri- 
bution YYi=iPi( x i) i s the product of p(x) with the Radon-Nikodym derivative ip(x), 
and the additive approximation obtained using the product measure (distribution) 
serves at the same time as an approximation with regard to the 'true' probability 
distribution, p(x). The upper error bounds in the L2-norm includes the derivative 
V>(ar). 

The research here is motivated by the observation that adaptive techniques like 
MARS m which estimate models of the form given in Equation (|l]) will produce 
models with predominantly lower order interactions. In practice, interactions with 
order higher than 5 are not used. Models of the type defined in Equation (H) have 
been called "ANOVA decomposition" in || as they generalise for real variables the 
models which are used in the analysis of variance (ANOVA). Applications of the 
ANOVA decomposition for the analysis of techniques for variance estimation can be 



found in and for the estimation of quadrature errors in [16] and [12|. The work 



here extends the previous work by providing estimates for the approximation errors 
of truncated ANOVA decompositions. In earlier work by one of the authors [|l[] it 
was seen how the concentration of measure may be exploited to get highly effective 
numerical differentiation procedures. Computational techniques for the determination 
of the ANOVA decomposition can for example be found in |2C], |8|, p~0f| . 

More generally, data mining || ||, [7| is being developed for the analysis of large 
data sets which appeared in business and science due to the fact that both data 
acquisition and data storage have become inexpensive because of the availability of 
cheap transducers and data storage devices. Typically, data mining applications 
lead to very large data sets of high dimension, and high-dimensional problems are 
intrinsically difficult as they are affected by the curse of dimensionality [fj, [TT|. Both 
queries and the identification of predictive models are very time-consuming. At the 
same time, it turns out that the effects of high dimensions are not only bad and some 
may be successfully exploited to lead to highly effective algorithms (cf. e.g. [|13|, Q). 
In the ideal case, high-dimensional data is just data which contains high amounts of 
information and these added amounts of information should intuitively lead to better 
algorithms. 
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2. Guiding observations 



2.1. The paradox of increasing distances. The first basic concept is that of 
an object w G H. Examples include shoppers of a retailer, insurance customers or 
variable stars. These objects have many properties, some of which are observable. 
The array of observed properties is the feature vector x. We assume here that xeK" 
but arrays containing other types of components and even of mixed types occur as 
well. In order to distinguish objects we require some quantitative notion of difference 
or similarity between objects. It seems reasonable to assume that the Euclidean 
distance between two feature vectors, given by 



provides information about the difference of the underlying objects. However, this 
leads straight to the first paradox of increasing distances: the typical distance between 
two objects grows as we add new features, that is, distance grows with the number 
of features n. In other words, the more one knows about the objects, the more 
different they seem to appear and ultimately, the difference may become infinite. 
While the increased difference seems reasonable, the unboundedness of the distance 
is not, as intuitively two objects are only "different to a certain point". Fortunately, 
this paradoxical growth of distances may be easily cured by either normalising the 
Euclidean distance or else by scaling the variables in such a way that, for example, 
the average distance between two feature vectors is 1. As an example, consider the 
Euclidean cube [0, l] n ; it is easy to see that the average distance between two randomly 
chosen vectors x, y G [0,1]" (the characteristic size of the cube ||) is 0(y/n), and 
thus a natural way to normalise the Euclidean distance is 



While this paradox is seemingly simple, it is necessary to consider, and it is important 
that the dissimilarity is normalised in the way suggested so as to put things in the 
proper perspective. 

The predictor functions / we are interested in will estimate the probability with 
which a certain statement about the object is true, and thus the range of / is, typically, 
the unit interval [0, 1]. Moreover, it is reasonable to assume that the function assumes 
close values for close values of parameters. Usually this condition is expressed by 
requiring / to be Lipschitz, that is, to satisfy 



where L > is a positive number called the Lipschitz constant. 

We will assume that the data points are drawn from a domain Q with respect to 
some probability distribution P. Thus, we view the domain Q = (Q, d, A, P) as a 
probability space equipped with a distance (an mm-space, cf. 0). 

2.2. Concentration of measure and approximation by constants. The sim- 
plest class of additive functions is that of zeroth order of interaction, k = 0, in which 
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case the approximating functions / a dd are simply constants. It turns out that even 
the approximation by constants admits a substantial theory if the domain is high- 
dimensional. Such approximation improves as dimension grows, which observation 
is at the core of the phenomenon of concentration of measure on high- dimensional 
structures. (See [T5], [TJJ] and numerous references therein.) 

Let (Q n )™ =1 be an infinite family of metric spaces equipped with probability mea- 
sures. Intuitively, the spaces fl n should be thought of as having asymptotically grow- 
ing dimension. Assume that the distances are so normalised that the characteristic 
size of each Q n is 0(1). Let / be a Lipschitz function on some space fl n , and assume 
for simplicity that the corresponding Lipschitz constant L — 1, that is, 

\f(x-y)\ < d(x-y) 

for all x, y G Q n . The concentration phenomenon refers to the observation that 
for many 'natural' families (f2 n ) as above, the probability that f(x) differs from its 
expected value by less than e > is at least 

(2) 1-C ie xp(-C 2 e 2 n), 

where C\,Ci > are constants only depending on the family of spaces (D, n )'^_ 1 in 
question. Intuitively, it means that a 'nice' function on a space of high dimension 
'concentrates' near one value. 

As an example, consider hyperspheres. The Euclidean n-sphere of unit radius is 
defined as 

S n := {x G R n+1 : ||x|| = 1}. 
The constants for the family (§ ri )^L 1 are 

Cl = \fl ' C% = \ 

Similar estimates with varying constants hold for the hypercubes (remember that the 
distance has to be appropriately normalised), the Euclidean spaces with the Gaussian 
measure, the Hamming cubes, the groups of unitary matrices, and so forth. (Loco 
citato.) 

Concentration of measure in particular enables one to explain the following, well- 
known, observation made long ago about large and multidimensional datasets. Con- 
sider an arbitrary object and its nearest neighbors. It can then be observed in practice 
and verified theoretically that with higher dimensions the distances between the point 
and its nearest neighbors become almost constant, so that the nearest neighbors all 
are close to a hypersphere around the point. In fact, the large majority of the points 
seems to be contained in a thin shell around the point. (See e.g. ||, f|].) This paradox 
is illustrated in Figure [l] where the hundred nearest neighbors of a point are displayed 
all from an i.i.d. set of normally distributed data points. To derive this paradox from 
the phenomenon of concentration of measure, it is enough to apply the bound (0) 



to the function f(x) = d(x*,x), where x* is the fixed (query) point. (See [HJ for 
details.) This paradox can not be cured, at least not if one wants to keep the same 
distance (similarity measure) between datapoints, and is the origin of many problems 
one faces with high-dimensional data. 

Returning back to the result captured by formula one concludes that the pre- 
dictor function / on a high-dimensional domain will be closely approximated by a 
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Figure 1. 100 nearest neighbors of a point from i.i.d., normally 
distributed data points in 2 dimensions (left) and projected using a 
random distance preserving mapping from 100 dimensions (right). The 
edges are defined by a Delaunay triangulation. 



constant function, if the dimension is large enough. However, how good is such an 
approximation from a practical point of view? It comes as no wonder that the di- 
mension required for the upper bound in @ to become genuinely small is extremely 
high. 

For example, suppose the data is uniformly distributed on the hypersphere of di- 
mension n of unit radius, Q = S n , and the predictor function / : S>™ — ► R is 1-Lipschitz 
and takes values in the range [—1, 1]. Let e > and suppose we want to approximate 
/ by a constant, M, in such a way that \ f(x) — M\ < e holds with probability > 1 — e. 
Even for the value of e — 0.2 (which is nowhere good enough) the minimal dimension 
required to achieve our goal is n = 41. For e = 0.1, the minimal dimension is n = 270. 
To obtain the accuracy e = 0.05, the dimension n = 5000 would suffice, but one can 
hardly expect a real dataset to have an intrinsic dimension of this sort, that is, to 
depend on five thousand independent parameters. Finally, to ensure that 

P{\f(x) - M\ < 0.01} > 0.99, 

which already seems to be a reasonably close approximation, one needs the dimension 
to be on the order of the astronomical (and unrealistic) n ~ 10 5 . 

It is clear from the above that approximation by constants is not good enough in 
the medium to high dimensions which is the case we are mostly interested in. The 
next natural question is therefore: will the approximation error bounds based on the 
concentration phenomenon and given by formula (0) improve automatically if one 
allows the approximation by additive functions of higher interaction order than zerol 

It seems quite natural that by significantly relaxing the restrictions on the class 
of approximating functions one gets better approximation bounds. Rather surpris- 
ingly, it is not the there exist functions on n-dimensional domains for which 
approximation by constants is the best possible among all additive functions with 
interaction orders k of up to n — 1. (Subsection fO| .) 
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In view of the existence of such examples, it seems in a sense unavoidable that one 
should impose additional restrictions on the predictor functions / to obtain better 
bounds on higher-order approximations with additive functions. We will now put 
forward such restrictions as we find most natural, in the hope that the reader is 
prepared to accept them as such. 

2.3. Our assumptions. Key to all approximation results are assumptions about the 
data set and the class of functions to be approximated. As we are interested in the 
asymptotic behaviour as the dimension becomes medium to large (which means in 
practice larger than 10), we actually are interested in a family of spaces, functions 
etc, parameterised by their dimension. We will assume that the feature vectors are 
distributed with a density p(x) which has a first moment 

E(x) = J xp(x)dx 

and a variance 

E(\\x- E{x)f) = c 

which does not depend on the dimension. 

For some of the theorems it will be required that the components of x are indepen- 
dent, that is to say the underlying data distribution satisfies 

n 

i=i 

However, for practical purposes this assumption is unrealistic, because in the context 
of data mining where one has many physical variables (n is large), one would expect 
that those variables could be highly correlated. In view of this, we will subsequently 
replace the condition of independence with a milder restriction for the product dis- 
tribution p(x) to be in the same measure class as the product distribution, 

n 

p(x) ~ JJpi(Xi). 
i=l 

We will refer to such random variables xi,X2, ■ ■ ■ ,x n as quasi-independent. 
The functions we consider are assumed to be Lipschitz-continuous, i.e., 

\f(x)-f(y)\<L\\x-y\\ 

for some constant L which is independent of the dimension. In the case of differen- 
tiable functions / this corresponds to the bound 




This condition is very natural and invoked frequently, it assumes that function values 
have similar sizes for points which are close. 

"Smoother" functions will be defined as functions satisfying the condition 

v / crf( X ) y 2 
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for some constant L m which is independent of the dimension n. While such smooth- 
ness definitions do always have a certain degree of arbitrariness and are difficult to 
check in applications, one can see, first, that this condition generalises the Lipschitz 
condition, and, second, that for the case of functions given as products 



/(*) = !!/. 



this bound follows from Lipschitz continuity for functions bounded away from zero 
as one can verify that 



fix 



,2(m-l) 




In particular, if f(x) > 1 then L m = L m . Similar bounds for the components of 
the decomposition from Equation ([I]) are used in 1^] for the analysis of quadrature 



formulas based on a reproducing kernel Hilbert space. We make here the small step 
to consider families of functions where the bounds on the derivatives are independent 
of the dimension in order to obtain an estimate of the approximation error behaviour 
as a function of dimension. 

So far we have assumed that the expected norm squared of the vectors x does 
not grow with dimension. However, in many practical applications this is not the 
case. Often, the features are normalised, say, so that they are all in [—1,1]. This is 
the normalisation we will mostly consider here if not mentioned otherwise. In this 
and many similar cases the variance grows proportional to the dimension n. Thus 
by dividing all the variables by \fn one gets to the previous situation again. If one 
invokes the Lipschitz condition in the transformed variables, one has for the gradients 
in the original variables the condition 



(3) E 



i=i 



2 

dxj J ~ n 



dfY < L* 



which is thus equivalent to the Lipschitz condition. For the higher order derivatives 
one gets the condition 



E 



\<i;<—<i m <n 



d m f(x) 



dx h ■ ■ ■ dx in , J n r 



2 



< 



The dependence of the bounds on the dimension n appears unnatural at first, however 
note that this is simply a consequence of the scaling of the variables, corresponding 
to Lipschitz-continuity in the unit ball. The alternative to the introduction of this 
scaling in the conditions would be to scale the variables so that the average distances 
would be independent of the number of variables. For convenience, we have here 
chosen to normalise all the X{ to [0, 1] and to scale the smoothness conditions. While 
these conditions are strong, they are not unusual, and allow the generalisation of the 
well-known limit theorems for 



n 



Xj, 

n 

8=1 
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of the average of n i.i.d. random variables. A nontrivial example, for which the 
conditions hold is given by 

f(x) = exp I - J^x?/n 

V i=\ 

A far-reaching but so-far implicit assumption is that all the variables Xj contribute 
in the same way to /. This is true for applications where the features X{ correspond 
to equally important parts or elements and occur when / is describing an aggregate 
of similar elements each characterised by one Xj. Examples include employees of a 
company and stars of a galaxy. An alternative situation is considered in [[14], [Tj| [T| 
by H. Wozniakowski and his collaborators. There the variables are given weights de- 
pending on their importance. These weights then enter the smoothness assumptions. 
One of the consequences of that choice is that in many cases the negative effects of 
the concentration effect discussed below can be avoided. Here, however, we consider a 
different situation of equally important variables and thus have to deal with the con- 
sequences of the concentration. The two examples of functions given above illustrate 
that functions like the mean which "change" with the dimension n are very natural. 



3. Best approximations with additive functions 
In this section we formulate the main results of this paper. 



3.1. Independent random variables. First the notation is established. Let (ft, A, P) 
be a probability denote a family of random variables and E(f | 

x ix , . . . ,x ik ) be the usual conditional expectations. Throughout this Subsection, we 
make a standing assumption that the random variables Xi,x 2 , ■ ■ ■ ,x n are indepen- 
dent, i.e., the density distribution is of the form 

n POO 

(4) p(x) = Y\_Pi( X i) With / Pi(Xi)dXi = 1. 

i=l 

The operator Di is defined as 

(Dif)(x) = f(x) - E(f | x u ... . . . ,x n ). 

Using the independence assumption on random variables, one gets a 'telescoping sum' 

n 

(5) f{x) = E(f)+ ^DiE(f\ Xl ,... , Xi ). 

i=l 

The terms of the sum can now be expanded in the same way as / and repeated 
application of these expansions provides a theorem which looks very much like Taylor's 
theorem: 
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Theorem 3.1. Let f be an integrable function on Q. Then for every natural m, 
1 < m < n: 

n 

f{x) = E{f) + D t E{f | Xi ) + D hD h E(f I x i2 ,x h ) + --- 

i=l l<i2<ii<n 



+ ^ 1 Di m -\ ' ' ' Dii-E(f | Xi m _ 1 j ■ ■ ■ j X^ ) 

1< 

+ ^ ] Di m ■ ■ ■ D^E^f I Xi, x 2 , ■ ■ ■ , Xi m , ajj m _ 1 , £i m _ 2 5 • • • ; x ii)- 



l<im<-<ii<n 

Proof. The proof uses induction. First, the case m — 1 is just Equation ([|). 

If the equality holds for m = k — 1 then the first — 1 terms are the same as for 
m = k and only the last term needs further expansion. In this term each summand 
is a function of xi, . . . ,x ik _^i and from Equation (||) one obtains: 

^ ] Di k _ 1 • ■ ■ D^E^f | Xi, . . . , Xi k _ 1 , Xi k _ 2 , • • • , ) 
1 <*fc-i<---<n<« 

= ^ ] Di k _ 1 ■ ■ ■ D^E^f | Xj fc _ 1 , . . • , Xjj) 

l<jj,_ 1 <---<il <n 

+ ^ ^ ■ ■ ■ E^E^f | #1, . . . , Xj fc _ 1 , . . . , a^). 

l<i fc <---<il<n 

Replacing the last term in the equation for the case m — k — 1 with the right-hand 
side of this equation leads to the equation for m = k. □ 

A similar decomposition for the special case of m = n has been proved in || where 
the theorem is called Decomposition Lemma. 

Next we introduce the space of L 2 functions which are sums of functions only 
depending on k variables each as: 

L 2 ,k ■= {g(x) = Y 9ii,...,i k ( x in--- > x h) e L 2>- 

ii,... ,i k 

(Note that L 2 ^ are closed, which follows from Theorem |3.2| below.) Now we introduce 
the operator P m : L 2 — > L 2 ^ m such that 



(P m f)(x) = E(f) +J2 D Mf I x t ) + Y D » D h E (f I x i» x h) + ■■■ 

i=l l<i2<ii<n 
+ Y D im'-- D h E U I X im ,... ,X h ) 



l<i m <---<ii<n 

and the remainder operator R m : L 2 — > L 2 ^ m with 

(R m f)(x) = Y D im'-- D h E (f I zi,... ,x im ,... ,x h ). 



l<im<---<ii<n 



From theorem [D] one then gets / = P m f + R m+ if. 

Theorem 3.2. The operator P m is an orthogonal projection, and 

E((f - P m f) 2 ) < E((f - gf), for all g e L 2 , m and f G L 2 . 
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Proof. The statement that P m is a projection, i.e., P^ = P m , is equivalent to showing 
that R m+ if is zero for / e L2,m- This follows directly from the fact that any function 
/ for which the function values do not depend on the variable x,i one has Dif = 0. 
As any / e L<i.m consists of a sum of functions which depend only on m variables 
and all the terms in R m+ i contain D im+1 ■ ■ ■ D i± there is at least one i k for which any 
particular term in the expansion of / does not depend on x ik and thus -R m+ iL 2 ,m = 0. 

If p(x) is a product distribution as assumed above, then J pi(xi)dxi = 1 for every 
projection measure and for any g(xi, . . . , x n ) one has 

Dig(xi, . . . ,x n )pi(xi)dxi 

g{x x , ... ,x n )- [ g(xt, ... , s, x i+1 , ... , x n )pi(s)ds ) p^x^dxi 
=0. 

Now as for each m-tuple ki < • • ■ < k m and for each (m + l)-tuple i m +i < • • ■ < %\ 
one has at least one ij which is different from all the k s then 



A m+1 • • • D h E(f \x u ... ,x im} x im _ 1} ... ,x h )g(x kl ,... ,x km )p(x)dx 
= / ( / A m+ i • ■■D il E(f | •■■ )p i .{x ij )dx i . J g(x kl , . . . ,x krn ) Y[p t (x t )dx t 
=0. 

This shows that the error term R m+ if is orthogonal on L 2im which implies that P m 
is an orthogonal projection into £2,™ and from this the minimisation characterisation 
follows. □ 



The orthogonality of all the components of the decomposition is shown in |16| for 
the case of the uniform distribution. 



Now we observe that R n = and thus the decomposition as in Theorem |3.1| termi- 
nates and so 

n 

f(x) = E(f) + J2 E D im --.D h E(f\x im ,... ,x h ). 

m=l l<i m <—<ii<n 

If all the variances of these terms exist, one has from the orthogonality of this de- 
composition 

n 

var(/) = E E E (( D *~ ■ ■ ■ ^ W I x im ,... , x h ) ) 2 ) . 

m=l l<i m <---<ii<n 

Error estimates are obtained for different iable functions, one may also get bounds 
based on Lipschitz constants. We introduce the (marginal) cumulative distribution 
function 



/X 
Pi(s)ds 
-00 



and the kernel 

ki(xi,ti) = Pi{ti) H{ti Xi) 



ADDITIVE MODELS IN HIGH DIMENSIONS 11 

where H(x) is the Heaviside function, i.e., H(x) = 1 for x > 1 and H(x) = for 
x < 0. Using integration by parts one gets for differentiable /: 

f°° df 

(6) Dif(x) — J ki(xi : ti)-^j-(xi, . . . , tj, Xi + i, . . . ,x n )dti. 

Now let giiti) : = §f(xi, . . . , Xi-i,U, Xi+i, . . . , x n ) then the expected value squared is 
E((Dif(x)) 2 ) = / kiixi^ijkixi^ijgiit^giis^piix^dtidsidxi. 



Now let 

(7) Gi(*i,t 2 ) := min (P^ a )(l - P(t b )) 

a, 6=1, 2 

and 

(8) 7 := J maxGi(t,s)dtds. 

Then one gets by integration by parts and from the Cauchy-Schwarz inequality: 

n 

5>((A/0r)) 2 )<7£ 2 
i=i 

if / is Lipschitz continuous with constant L. 

For the case of m interactions we first define the seminorm |/| m by 



l<U<--<im<n 

and then obtain: 

Theorem 3.3. Let i? m 6e defined as in 

(Rmf) (x) = D i m --- D ii E if\xi,...,x im ,...,Xi 1 ). 

l<j m <-<ii<n 

T/ien one /ias /or t/ie mean squared error bound 

(9) £((iW) 2 ) < 7 m |/lm- 



Proof. It is shown the same way as in an earlier theorem that all the terms of the 
sum defining R m f are orthogonal and so 

E((R m f) 2 ) = E ((^ • ' ' ki, • • • , x Jm , . . . , XiJ) 2 ) . 

l<ii<---<i m <n 

For simplicity set 
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and application of Equation ffl) and similar reasoning as for the case m = 1 gives 



E((R m f) 2 ) = 

/m 
gi 1 ,...,i m (t il , . . . , *i m )^n,...,i m (si 1 , . . . , Sj m ) J^GV,- {sijiUj) dsdt 

< |/|m 7m 

where 

max TfGi.fsi.jtiJcZsdt. 

l<h<-<i m <n ll 7 v J 3/ 

Now one can see that 7 m < 7 m and from this the claimed bound follows. □ 
Consider the case 



Jr, 



n 
i=l 

for a fixed g which are independent of n. For example, let the x be uniformly dis- 
tributed in the unit hypercube. Then the constant 7 is independent of n. Further- 
more, in Section |] it was suggested that the appropriate smoothness restriction on / 
is 

T 2 

I ,,2 < 

n m 

From the previous theorem one can in this case conclude that 

For example, if q is the uniform distribution on [—1, 1] the constant 7 can be computed 
explicitly and one gets 7=3 and so 

E{{R m f) 2 )< U 



3 m n m 

In the case where q is the normal distribution with expectation and variance 1 one 
gets 7 = 0.516 (rounded) and thus 

E{{R m ff) < 

The same bound is obtained if the components of x are i.i.d. normal with a variance 
such that i?(||x|| 2 ) = 1 and \f\ m < L m . 

3.2. Quasi-independent random variables. As we have already noticed, the 
assumption on independence of feature vectors is not realistic for most practical data 
sets. Here we propose one way to overcome this difficulty and get an approximation 
which 

1. Satisfies a similar error bound as the one for the independent variable case. 

2. Has an error of the same order as the best least squares fit. 
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Let ji\ and ji2 be measures on the same sigma-algebra of sets. Recall that //i is 
absolutely continuous with regard to \i2 if \ii{A) = always implies fii(A) = 0, that 
is, every /i 2 -null set is also a /ii-null set. The Radon-Nikodym theorem then implies 
that (Ai = ip(x)H2, where ip(x) = d[i\jd[i2 is a measurable function called the Radon- 
Nikodym derivative. The measures fii and \ii are equivalent, or in the same measure 
class, if they are absolutely continuous with respect to each other, that is, have the 
same null sets. 

In our context, say that the random variables xi,X2,--- ,x n on the probability 
space (Q, A, P) are quasi-independent if the probability distribution p(x) is in the 
same measure class as the product of marginal distributions: 



p(x) ~Y[pi 



1=1 



Denote by P® the product measure on X with the product distribution above, and 



let 



ip(x) 



dP® 

IF 



be the Radon-Nikodym derivative of the product measure P® with regard to the 
underlying measure P on the data set. Then one has 



(10) 



I^PiO^i) = ip(x)p(x) 



i=l 



Given a predictor function / on X, let P® be the orthogonal projection defined 
as in Subsection [3~T1 , but using the product measure P® = ip(x)P rather than the 
original probability distribution P. Then P®(/) gives an approximation of / by a 
sum of additive functions of the order of interaction < m. Denote 

p<8> t — f _ f 

Let E® denote the expected value with regard to the measure P®. Then for every 
random variable y on X, 

E(y) = E®(yij- 1 )<E®(y)\\ij- 1 \\ oo , 

where W'W^ denotes the Loo-norm. 

Remember that the constant 7 only depends on the marginal distributions Pi(x), 
and so it remains the same no matter which of the two measures, P or P®, we are 
considering, cf. (^) and Now Theorem |3.3| leads to the following estimate. 

Theorem 3.4. The square error bound of the additive approximation satisfies 



E{{Rlff ) < Y 



dP 



dP ri 



l/l 



Proof. Because of the equivalence of the measures the error in the original measure 
is bounded by 

dP 



dP® 



E ((RmfY 



and Theorem |3.3| for the measure P® then gives E®((R m f) 2 ) < 7 m |/|, ; 



□ 
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Finally, the approximation does also provide a bound for the error of the best 
approximation in L^^m- 



Theorem 3.5. 

(12) 

•where k 



min E((f - gf) < E((R®f) 2 ) < K min E((f - gf 



II dp II 


dP® 


II dP® lloo 


dP 



Proof. The lower bound holds by definition. For the upper bound we first use the 
property that the measures are in the same class to get: 

dP 



E{Rlf) < 



dPt 



Then we note that is a best approximation with respect to the norm defined by 
the product distribution by theorem |3]2|. Thus one has for any g e L 2 , m : 

E®(R®f)<E®((g-ff) 

and, as the measures are in the same class: 

dP^ 



E®((g-ff)< 



dP 



E((g-fY 



and combining these inequalities and taking the minimum over g provides the desired 
estimate. □ 

This process does not lead directly to a computational procedure in the case of 
dependent variables. We hope to discuss such a procedure in a consequent paper. 

3.3. Extensions. We have so far assumed that / depends on exactly n variables 
X\, . . . ,x n . Again in practice, any response variable Y is typically only partially 
described by a function of the predictor variables and a large proportion (in particular 
in data mining applications) of Y remains unexplained by f(xi, . . . , x n ). One way to 
model this situation is to assume that there are actually N > n predictor variables 
and by only limiting the approximations to the first n another error is introduced. 
The error of this (truncation) approximation is 

T n f ■= f(xi, ■■■ ,x N )- E(f\x u ... ,x n ). 

In this case the error of the approximation P m>n (previously called P m ) has to include 
this term as well and thus the total error is now 

E m ,nf P^m,nf T n f 

where R m ,n (previously R m ) is the error term of the ANOVA decomposition of 
E(f\xi, . . . ,x n ). These two error terms are orthogonal in the case of independent 
variables and one gets for the expected error squared: 

E{{E m>n ff) = E((R m>n f) 2 ) + E{{T n ff) 

Now the approximation in the space of functions of n variables is the best possible 
and so is the additive approximation. Consequently, by increasing n the error is never 
increased. Clearly, if the error of the proposed additive model is to be small then 
both the terms T n f and Rm,nf need to be small. In many practical cases, however, 
the term T n f cannot be made small and thus major portions of / are beyond our 
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Figure 2. n m l 2 times the square root of the average errors squared 
of a constant, additive and interaction approximation of f(x) = 

ex p(-Er=i^ 2 /^)- 

controll. This, however, does not mean that the approximation P m ^ n is of no practical 
use and being able to controll a portion of the variation of / can lead to commercial 
and scientific benefits. 

Note that the smoothness condition for this case is really a smoothness condition for 
E(f\xi, . . . , x n ). An interesting question which nevertheless has not been investigated 
here is how the smoothness of the underlying function f(x±, . . . ,xn) determines the 
smoothness of the projection E(f\xi, . . . ,x n ). 

4. Examples 

In this section we will look at a few examples in more detail and we will investigate 
how well the theory of the previous sections applies. In the examples we will be 
considering approximations with constant functions, with additive functions, and with 
functions with second and third order interactions. The order of the interactions is 
m — 1 where m has the values 1,2,3 and 4 respectively. 

4.1. Example 1: uniform distribution on hypercube. For the uniform distri- 
bution on the hypercube [—1, l] n one has 7 = 1/3. As a function to approximate we 
choose 

f(x) = exp -^2x 2 Jn 
V i=l 
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Figure 3. Square root of the average squared errors of a constant, 
additive and interaction approximation of f(x) = (1 + sin(xi))(l + 
sin(x2))(l+sin(x3)) for normally distributed data with finite E(^2 i xf). 



From this one gets 




Thus one can choose the Lipschitz constant to be 




and consequently the bound from the previous section is 

Am 

E(R m f 2 ) < : — . 

For practical error estimates this bound is slightly too pessimistic. The order of 
convergence in n is accurate. This can be seen from the results of a simulation which 
are in Figure || where the average errors squared have been multiplied by n m//2 in 
order to confirm the 0(n~ m ^ 2 ) behavior. 

4.2. Example 2: Function of three variables on data with bounded E xf). 
Sometimes, functions are only dependent on a few of the variables. If the data is 
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Figure 4. ^"V 2 times square root 
a constant, additive and interaction 
sin(xi))(l + sin(x2))(l + sin^)) for 
finite £Q>?). 



of the average squared errors of 
approximation of f(x) = (1 + 
normally distributed data with 



equally distributed over many variables such that they have a uniformly (in the di- 
mension) bounded expected squared norm then the values of any component will 
concentrate around zero. This is illustrated in this example. Here the function con- 
sidered is 

f(x) = (1 + sin(xi))(l + sin(x 2 ))(l + sin(x 3 )). 

The data points are assumed to be i.i.d. normally distributed and in order to obtain 
the finite expected value of ^ x\ the variances of each component is a = 1/n. The 
error of the third and higher order interaction approximations is zero, for the lower 
order approximations see Figure [3] for the expected squared error. The theory again 
predicts asymptotic behaviour of the error of 0(n~ m ) which is confirmed by the 
simulation result displayed in Figure f|. 

4.3. Example 3: Approximation obtained with MARS. The theory and the 
examples so far have illustrated the best possible approximation with additive and 
interaction models. As in the first example the function 

f(x) = exp ^-X^V™! 

shall be approximated. We use 1000 data points uniformly distributed on [— 1, l] n 
and will let the dimension vary between 1 and 10. No random noise was added to 
f(x). The approximations are computed with the code "MARS" by J.Friedman ||. 
This code allows the specification of the maximal order of interactions and we have 
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Figure 5. Square root of the average squared errors of a constant, 
additive and interaction approximations of f(x) = exp(— ||x|| 2 /n) for 
data uniformly distributed on [— 1, l] n . Approximation obtained with 
the MARS code. 



investigated the approximations obtained for constants, additive functions and 2nd 
and 3rd order interaction models. The approximation uses tensor products of piece- 
wise linear functions and the basis functions used are products of functions of the 
form 

b{ Xi ) = (±( Xi - 0)+ 

where (x)+ := max(x, 0). The total number of basis functions allowed needs to be 
specified and we chose here 4-n m as an upper limit such that basically each term in the 
ANOVA decomposition may have 4 basis functions on average. In Figure |5| one can see 
that allowing higher order interactions lead to better approximations and also that in 
higher dimensions the approximations have a tendency to get better with dimension. 
One can get an optimal approximation if the number m of interactions allowed equals 
the dimension and thus one might expect a growth in error for dimensions close to m. 
There is also a problem that allowing too many basis functions of too high interaction 
may make the possible models too complex and thus introduce instability for small 
enough data sizes. The effect of allowing higher order interactions does not necessarily 
mean that MARS will finally select terms with higher order interactions at all. All 
these effects have to be taken into account when one interprets Figure |5]. 



4.4. Situations where constants give the best additive approximation. The 

functions whose best additive approximation is by constants (and whose existence was 
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claimed in Section ^) are those possessing high degree of symmetry. Without entering 
into technical details, let us mention two examples which seem intuitively clear. 

Let Q = (Q, A, P) be a probability space which is a domain in M. n and is such that 
the variables independent. 

Example 4.1. Consider the function 

/(x) = X X X 2 "'X n , 

defined on the hypercube which is symmetric about the origin, for example Q = 
[—1,1]™. It can be proved that the best additive approximation in the order of inter- 
action n — 1 is that by zero function. 

Notice that in accordance with our philosophy the function / has to be normalised 
by the factor of -, in order to keep its Lipschitz constant bounded by 1. The resulting 
function 

/i(x) = -xix 2 ■■■Xn 
n 

on the same cube assumed pretty small values: its maximum is just -, and thus one 
may argue that the approximation by zero function is not bad at all. 

The next example is somewhat stronger. 



Example 4.2. Denote by a usual bell-shaped function supported on the interval 
[— 2, |], that is, a C°° function taking values between and 1, which is identically 
zero outside of [— |, |], takes a positive value at 0, is monotone on each of the intervals 
[— 1,0] and [0, |], and satisfies 0^(0) = for all natural n. Let us also assume that 
0(0) = J. 

For a vertex, e = (ei,£2, • • • ,£ n ), of the cube [0, l] n define the parity of e as the 
number of ones among the coordinates modulo 2: 



E 



£j mod 2. 
i 



Now set for every x e [0, l] n 



/(x):= (-l) |£| <Mlk-x||), 

££{0,1}" 

where ||-|| denotes the Euclidean distance. A moment's thought shows that / is 
a well-defined C^-function assuming values in the interval between — | and |, in 
particular if x = e is a vertex, then f(e) = ±~ depending on the parity. Again, 
one can show that the above function admits no better additive approximation in all 
orders of interaction up to n — 1 inclusive than that by the zero function. 

Notice that the normalisation of the function / aimed at keeping the Lipschitz 
constant of the order 0(1) leads to the function 

/i(x) = -W(x), 

whose maximal values reach ttt=. 
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5. Conclusion 

In our paper we have attempted to perform initial analysis of the problem of approx- 
imating a predictor function on a high- dimensional dataset with additive functions 
allowing for interactions of a lower order. We are interested in the specifics of medium 
to high dimensions. The proposed model makes what we believe to be reasonable as- 
sumptions, from the modeling viewpoint, on the function to be approximated (the 
normalisation conditions and 'higher-order smoothness conditions'). We argue that 
some conditions of this kind are to be imposed in order to obtain approximation re- 
sults: we exhibit examples of Lipschitz functions in n variables for which the best 
additive function approximation of order of interaction n — 1 is a constant. Under 
the proposed conditions, we derive from a Taylor-type theorem upper bounds on the 
approximation errors. The results are illustrated on examples and compared to the 
results obtained using the MARS software package. The examples confirm that the 
asymptotic order of our error bounds is right. 
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